In [None]:
# Install required packages
!pip install --upgrade --quiet natural-pdf[ai]

print('✓ Packages installed!')

**Slides:** [slides.pdf](./slides.pdf)

# Multi-page flows

*Sometimes* you have data that flows over multiple columns, or pages, or just... isn't arranged in a "normal" top-to-bottom way.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/multicolumn.pdf")
page = pdf.pages[0]
page.show()

Natural PDF deals with these through [reflowing pages](https://jsoma.github.io/natural-pdf/reflowing-pages/), where you grab specific regions of a page and then paste them back together either vertically or horizontally.

In this example we're splitting the page into three columns.

In [None]:
left = page.region(left=0, right=page.width/3, top=0, bottom=page.height)
mid = page.region(left=page.width/3, right=page.width/3*2, top=0, bottom=page.height)
right = page.region(left=page.width/3*2, right=page.width, top=0, bottom=page.height)
mid.show()

Now let's **stack them on top of each other**.

In [None]:
from natural_pdf.flows import Flow

stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")
flow.show()

Now any time we want to use spatial comparisons, like "find something below this," it *just works*.

In [None]:
region = (
    flow
    .find('text:contains("Table one")')
    .below(
        until='text:contains("Table two")',
        include_endpoint=False
    )
)
region.show()

It works for text, it works for tables, it works for **anything**. Let's see how we can get both tables on the page.

First we find the bold headers – we need to say `width > 10` because otherwise it pulls some weird tiny empty boxes.

In [None]:
(
    flow
    .find_all('text[width>10]:bold')
    .show()
)

Then we take each of those headers, and go down down down until we either hit another bold header *or* the "Here is a bit more text" text. 

In [None]:
regions = (
    flow
    .find_all('text[width>10]:bold')
    .below(
        until='text[width>10]:bold|text:contains("Here is a bit")',
        include_endpoint=False
    )
)
regions.show()


Now we can use `.extract_table()` on *each individual region* to give us a bunch of tables.

In [None]:
regions[1].extract_table().to_df()

# Layout analysis and magic table extraction

Similar to how we have feelings about what things are on a page - headers, tables, graphics – computers also have opinions! Just like some AI models have been trained to do things like identify pictures of cats and dogs or spell check, others are capable of **layout analysis** - [YOLO](https://huggingface.co/spaces/omoured/YOLOv11-Document-Layout-Analysis), [surya](https://github.com/datalab-to/surya), etc etc etc. There are a million! [TATR](https://github.com/microsoft/table-transformer) is one of the useful ones for us, it's *just for table detection*.

But honestly: they're mostly trained on academic papers, so they aren't very good at the kinds of awful documents that journalists have to deal with. And with Natural PDF, you're probably selecting `text[size>12]:bold` in order to find headlines, anyway. *But* if your page has no readable text, they might be able to provide some useful information.

Let's start with [YOLO](https://github.com/opendatalab/DocLayout-YOLO), the default.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

In [None]:
# default is YOLO
page.analyze_layout()
(
    page
    .find_all('region')
    .show(group_by='type', width=800)
)

In [None]:
page.find('table').apply_ocr()
text = page.extract_text()
print(text)

### Better layout analysis with tables

Let's see what **TATR** - Microsoft's table transformer – finds for us.

In [None]:
page.analyze_layout('tatr')
page.find_all('region').show(group_by='type', width=800)

There's just *so much stuff* that TATR is finding that it's all overlapping.

For example, we can just look at one piece at a time.

In [None]:
# table-cell
# table-row
# table-column
page.find_all('region[type=table-row]').expand(-2).show(crop=True)

In [None]:
# Grab all of the columns
cols = page.find_all('region[type=table-column]')

# Take one of the columns and apply OCR to it
cols[2].apply_ocr()
text = cols[2].extract_text()
print(text)

In [None]:
len(cols[2].find_all('text[source=ocr]'))

In [None]:
page.find('table').show()

In [None]:
data = page.find('table').extract_table()
data

## YOLO

In [None]:
page.analyze_layout()
page.find_all('region').show(group_by="type")

In [None]:
page.find("region[type=table]").apply_ocr()

In [None]:
text = page.extract_text()
print(text)

In [None]:
from natural_pdf.analyzers.guides import Guides

table_area = page.find("region[type=table]")
guides = Guides(table_area)
guides.horizontal.from_lines()
guides.vertical.from_content(["Description", "Level", "Repeat"])
guides.vertical.snap_to_whitespace()
guides.show()

In [None]:
guides.extract_table().to_df()